CHEMDNER system with mixed conditional random fields and multi-scale word clustering
نویسندگان
چکیده
BACKGROUND The chemical compound and drug name recognition plays an important role in chemical text mining, and it is the basis for automatic relation extraction and event identification in chemical information processing. So a high-performance named entity recognition system for chemical compound and drug names is necessary. METHODS We developed a CHEMDNER system based on mixed conditional random fields (CRF) with word clustering for chemical compound and drug name recognition. For the word clustering, we used Brown's hierarchical algorithm and Skip-gram model based on deep learning with massive PubMed articles including titles and abstracts. RESULTS This system achieved the highest F-score of 88.20% for the CDI task and the second highest F-score of 87.11% for the CEM task in BioCreative IV. The performance was further improved by multi-scale clustering based on deep learning, achieving the F-score of 88.71% for CDI and 88.06% for CEM. CONCLUSIONS The mixed CRF model represents both the internal complexity and external contexts of the entities, and the model is integrated with word clustering to capture domain knowledge with PubMed articles including titles and abstracts. The domain knowledge helps to ensure the performance of the entity recognition, even without fine-grained linguistic features and manually designed rules.
منابع مشابه
WHU-BioNLP CHEMDNER System with Mixed Conditional Random Fields and Word Clustering
Our team participated in the Chemical Compound and Drug Name Recognition task of BioCreative IV. We used a mixed conditional random fields with word clustering to fulfillment this task. For one hand, we generate the word feature by word clustering and train the corpus with word feature to get one model. On the other hand, the training corpus is transformed to a new one in the reversed order of ...
متن کاملA comparison of conditional random fields and structured support vector machines for chemical entity recognition in biomedical literature
BACKGROUND Chemical compounds and drugs (together called chemical entities) embedded in scientific articles are crucial for many information extraction tasks in the biomedical domain. However, only a very limited number of chemical entity recognition systems are publically available, probably due to the lack of large manually annotated corpora. To accelerate the development of chemical entity r...
متن کاملIncorporating domain knowledge in chemical and biomedical named entity recognition with word representations
BACKGROUND Chemical and biomedical Named Entity Recognition (NER) is an essential prerequisite task before effective text mining can begin for biochemical-text data. Exploiting unlabeled text data to leverage system performance has been an active and challenging research topic in text mining due to the recent growth in the amount of biomedical literature. We present a semi-supervised learning m...
متن کاملBANNER-CHEMDNER: Incorporating Domain Knowledge in Chemical and Drug Named Entity Recognition
Exploiting unlabeled text data to leverage the system performance has been an active and challenging research topic in text mining, due to the recent growth of the amount of biomedical literature. Named entity recognition is an essential prerequisite task before effective text mining of biomedical literature can begin. The participants of the CHEMDNER task of the BioCreative IV challenge are as...
متن کاملIdentifying chemical entities in patents using brown clustering and semantic similarity
This paper presents the system we developed for the CHEMDNER task of BioCreative V. This system was adapted from the IICE framework, which combines Conditional Random Fields, implemented by Stanford NER, brown clustering, implemented by Percy Liang’s Cbased algorithm and a semantic similarity based on the h-index concept, applied to the ChEBI ontology. For the CEMP subtask, we obtained a maximu...
متن کامل